BMC Bioinformatics — Latest Matching Preprints

1

Reproducible-by-design: Romics Processor, a FAIR ecosystem for multi-omics and spatial-omics analysis

Gorman, B. L.; Bhotika, H.; Jehrio, M.; Purkerson, J. M.; Carlin, F.; Nakayasu, E. S.; Misra, R. S.; Adkins, J. N.; Anderton, C. R.; Pryhuber, G.; Clair, G. C.

2026-07-15 bioinformatics 10.64898/2026.07.09.737600 medRxiv

Top 3%

2.4%

Show abstract

Multi-omics and spatial-omics technologies are exploding in use, producing increasingly complex datasets. Existing bioinformatics tools are developing rapidly but fail to fully enforce the FAIR principles, leaving the field vulnerable to escalating issues in computational reproducibility. Here, we introduce a reproducible-by-design paradigm represented in an omics data processing package, RomicsProcessor. At its core, the "Romics_object", which is a self-contained digital artifact that encapsulates the full history of the data from the original data to the fully processed state, capturing the details of the transformative steps and the required dependencies. This architecture ensures that computational workflows are fully portable and reproducible. In this manuscript, we demonstrate RomicProcessors computational capabilities and scalability on diverse datasets, including bulk proteomics, large-scale multiplexed immunofluorescence, and multi-batch mass spectrometry imaging. Providing a robust framework for truly FAIR Data Principles-based analysis, RomicsProcessor is a blueprint for the next generation of reproducible bioinformatics tools that can dramatically accelerate discovery in multi-omics biology in the era of artificial intelligence.

2

Gene Regulatory Networks that support Multi-Fate Cellular Decisions

BV, H.; Adigwe, S.; Jolly, M. K.; Gedeon, T.

2026-07-15 systems biology 10.64898/2026.07.13.738161 medRxiv

Top 4%

2.1%

Show abstract

AO_SCPLOWBSTRACTC_SCPLOWCell fate decisions are driven by gene regulatory networks (GRNs). While the mutually inhibitory toggle switch effectively models binary fate decisions, fully connected inhibitory networks with more than two nodes fail to capture multi-fate decisions due to the low prevalence of "single high states", where only a single master regulator is highly expressed. The goal of this study is to find network structures that support all single high states. We find that the only network that attains the highest possible prevalence of all single high states within the set of monotone Boolean (MB) models is completely disconnected. Since biological networks typically require connectivity, we investigate network structures that support equipotency, where all single high states have equal prevalence within MB models. Finally, we characterize the networks that support multistability between all single high states, finding that it is possible only in networks in which each node either has self-activations or is inhibited by every other network node. Our findings provide a theoretical framework for understanding the network design principles that can support simultaneous differentiation into multiple distinct cell types.

3

In Silico Trial Simulation with Artificial Intelligence-Generated Synthetic Control Cohorts Reproduces Results of a Randomized Controlled Trial in Acute Myeloid Leukemia

Kumar Reddy, K.; Hahn, W.; Winter, S.; Roellig, C.; Mueller-Tidow, C.; Serve, H.; Baldus, C. D.; Fransecky, L.; Schliemann, C.; Burchert, A.; Schaefer-Eckart, K.; Kaufmann, M.; Schetelig, J.; Bornhaeuser, M.; Middeke, J. M.; Eckardt, J.-N.

2026-07-16 health informatics 10.64898/2026.07.15.26358123 medRxiv

Top 5%

1.1%

Show abstract

Rising costs, slow accrual and molecular substratification of cancers necessitate novel clinical trial designs. We demonstrate that artificial intelligence-generated synthetic patients can replace real controls to reproduce results of the SORAML trial. Using external multimodal data from 1,377 acute myeloid leukemia (AML) patients from previous trials and a real-world registry, we fine-tuned a tabular foundation model to generate synthetic patients, reproducing clinical and genetic features and outcome associations. Synthetic patients were then matched to the original SORAML intervention group using Cox risk scores, replacing the original control and reproducing the original trial result with near-identical median event-free survival (EFS) and treatment effect (original hazard ratio [HR] 0.64, 95%-confidence interval [CI] 0.47-0.87, p=0.004; with synthetic control HR 0.66, 95%-CI 0.48-0.90, p=0.009). Our findings demonstrate that AI-generated synthetic patients can serve as statistically rigorous controls supporting novel trial designs.

4

Analytical perturbation reveals hidden instability of biological phenotypes

Piorkowska, N. J.; Ostromecki, A.; Franik, G.; Bizon, A.

2026-07-16 endocrinology 10.64898/2026.07.13.26357916 medRxiv

Top 5%

1.0%

Show abstract

Background Unsupervised machine learning has become a cornerstone of computational phenotyping across clinical medicine, genomics, imaging, and multi-omics research. However, phenotype discovery relies on a sequence of analytical decisions - including missing-data handling, preprocessing, dimensionality reduction, clustering methodology, and stochastic initialization - that are rarely evaluated collectively. Although clustering stability has been extensively investigated, the robustness of complete analytical workflows remains largely unexplored. Results We developed an Analytical Perturbation Framework that systematically quantifies the robustness of phenotype discovery by perturbing complete unsupervised learning workflows rather than individual clustering algorithms. Using a real-world cohort of 1,286 women with polycystic ovary syndrome (PCOS), we generated 116 valid analytical pipelines comprising alternative preprocessing strategies, missing-data handling methods, dimensionality reduction approaches, clustering algorithms, and random initializations. Agreement between independently generated phenotype solutions was consistently low (median Adjusted Rand Index = 0.079), indicating substantial sensitivity of phenotype discovery to routine analytical decisions. Variance decomposition identified preprocessing as the largest contributor to phenotype instability (22.8%), followed by clustering methodology (14.6%), whereas stochastic initialization explained only 3.1% of the observed variability. At the patient level, most individuals exhibited reproducible phenotype assignments (median Patient Robustness Score = 0.719), although a substantial subgroup showed markedly lower assignment stability. Feature perturbation analyses identified follicle-stimulating hormone, anti-thyroglobulin antibodies, anti-thyroid peroxidase antibodies, total testosterone, luteinizing hormone, and androstenedione as the strongest contributors to computational robustness, rather than biological importance. Finally, phenotype solutions demonstrating greater computational robustness also exhibited greater biological coherence during independent validation.

5

Genetic sensitivity analysis: estimating genetic confounding and environmentally mediated genetic effects using multiple exposures

Frach, L.; Rijsdijk, F.; Hannigan, L. J.; Dudbridge, F.; Pingault, J.-B.

2026-07-17 epidemiology 10.64898/2026.07.16.26358236 medRxiv

Top 6%

0.6%

Show abstract

Polygenic scores are imperfect measures of the additive genetic effects of common genetic variants. The resulting measurement error biases estimates of quantities of interest in epidemiological analyses integrating polygenic scores. For example, how much of an exposure-outcome association is genetically confounded can be substantially underestimated when using polygenic scores alone. Here we present extensions to Gsens, a genetic sensitivity analysis, which aims to correct for such measurement error using both polygenic scores and heritability estimates. Gsens now allows for multiple exposures and estimates several quantities of interest, i.e. genetic confounding, adjusted residual association (net of genetic confounding), genetic overlap and environmentally mediated genetic effects. We present derivations and simulations showing how Gsens accounts for measurement error in the polygenic score; we also show how estimation may be affected by misspecifications of the causal structure between exposures. Applying Gsens in the Norwegian Mother, Father and Child Cohort Study (MoBa), we uncover, among other results, substantial genetic confounding in the associations between multiple known risk factors for attention deficit hyperactivity disorder (ADHD), such as low birth weight and temperament, and measures of ADHD in childhood. The updated Gsens R package offers multiple options, including for missing data handling and customisable syntax. Our extended version of Gsens is applicable to a broad range of substantive questions in multiple disciplines.

6

FoodScribe: an open-source semantic framework for nutrient estimation from free-text dietary records

Gouda, H.; Sala Climent, M.; Agongo, J.; Gaikwad, S. P.; Nattakom, A.; Zhao, H. N.; Xing, S.; Boland, B. S.; Holt, T.; Guma, M.; Dorrestein, P. C.

2026-07-17 nutrition 10.64898/2026.07.15.26358181 medRxiv

Top 6%

0.6%

Show abstract

Efficiently summarizing dietary records at scale remains a persistent bottleneck in nutritional epidemiology. We present FoodScribe, which translates free-text meal descriptions into quantitative nutrient profiles by combining ingredient parsing with nutrient retrieval by querying the USDA FoodData Central (FDC) database. Benchmarked using three LLM providers using Nutribench dataset, FoodScribe completed annotation of 3,807 meal descriptions in 2.5 hours, a task otherwise requiring substantial manual effort from trained nutritionists. FoodScribe achieved accuracy across macronutrient estimation (F1=0.79-0.89), with models performing better for protein than fat estimation. Application to a Mediterranean diet intervention cohort indicated dietary shifts consistent with the intervention pattern based on model-derived estimates. Integration with metabolomics data suggested that fiber and vegetable intake were positively associated with a fecal metabolite cluster.

7

Description of the Intervention for Virological Suppression in Youth with HIV (iVY): A telehealth behavioral intervention focused on mental health, substance use, and HIV care engagement among youth living with HIV

Balaban, C.; McCuistian, C.; Ortega Roque, H.; Gruber, V. A.; Johnson, M. O.; Saberi, P.

2026-07-15 hiv aids 10.64898/2026.07.13.26357907 medRxiv

Top 7%

0.5%

Show abstract

Objective: Youth with HIV experience persistent disparities across the HIV care continuum, including low rates of engagement in care and viral suppression. In a recent national survey, youth and young adults, defined by the CDC as ages 13-34, accounted for approximately 20-40% of new HIV diagnoses in the United States. We describe the Intervention for Virological Suppression in Youth with HIV (iVY), a youth-friendly, tailored approach that integrates mental health and substance use support with HIV treatment engagement. Design: This paper describes the development of the intervention used in iVY, which is currently being evaluated in a randomized clinical trial (RCT) using an adaptive treatment strategy. HIV virological suppression is measured via dried blood spot at 16 weeks. Setting: The intervention is delivered fully remotely across California and Florida. Participants: YWH aged 18-29 who are not durably virally suppressed are enrolled and randomized to the intervention or usual care. The RCT will enroll and randomize 200 participants to the intervention (n = 100) versus usual care (n = 100). Intervention Description: iVY includes: (1) tailored brief, weekly video-counseling sessions focused on HIV treatment adherence and engagement, mental health, substance use, and related barriers; and (2) a mobile health application designed to support adherence, resource access, and peer connection. Participants who are not virally suppressed receive an additional 16 weeks of intensified intervention, while responders continue with app-based support. Conclusion: This paper provides a detailed description of a telehealth-based behavioral intervention tailored to the needs of youth with HIV. The intervention offers a scalable model for integrating behavioral health and HIV care to address barriers to treatment engagement in this priority population.

8

Computational design of a multi-epitope vaccine against M. tuberculosis

Buhari, A.; Okutu, P.; Oyeleke, U. A.; Sivakumar, A.; Hameed, S. A.

2026-07-15 bioinformatics 10.64898/2026.07.09.737463 medRxiv

Top 7%

0.4%

Show abstract

BackgroundTuberculosis remains a leading global infectious killer, with BCG offering inconsistent adult protection and rising drug-resistant strains demanding novel vaccine strategies. We report the first multi-epitope vaccine construct simultaneously targeting three previously unexplored Mycobacterium tuberculosis virulence proteins; EccB3, MycP, and polyketide synthase which collectively govern nutrient acquisition, ESX secretion integrity, and innate immune evasion. MethodsUsing a reverse vaccinology pipeline, B-cell, CTL, and HTL epitopes were predicted, filtered for allergenicity, toxicity, and IFN-{gamma} induction, then assembled into an 823-residue chimeric construct incorporating beta-defensin and PADRE adjuvants with AAY/GPGPG linkers, covering [~]90% global HLA diversity. The construct underwent AlphaFold structure prediction, 3DRefine refinement, disulfide engineering, PROCHECK/ProSA validation, ClusPro 2.0 docking against TLR1/TLR2, and C-IMMSIM immune simulation. ResultsThe construct (82.3 kDa, instability index 32.48) showed strong structural quality (94.7% favoured Ramachandran residues), stable TLR1/TLR2 binding (weighted energy: -1,371.0 kcal/mol), and robust in silico immune responses and durable memory cell formation following booster simulation. ConclusionThis computationally validated construct represents a promising multi-target TB vaccine candidate warranting experimental advancement.

9

Multimodal gene prioritization reveals nonlinear regulatory architecture in childhood-onset asthma

Huang, N.; Ragsac, M. F.; Gui, X.; Tantisira, K. G.; Amariuta, T.

2026-07-16 genetic and genomic medicine 10.64898/2026.07.14.26357983 medRxiv

Top 7%

0.4%

Show abstract

Asthma is a heritable complex disease that disproportionately burdens minority and admixed populations in the US. However, the causal genes and regulatory mechanisms governing inherited risk remain largely unresolved. We performed a European-ancestry meta-analysis of 141,894 cases and 1,361,846 controls drawn from the Trans-national Asthma Genetic Consortium (TAGC) and Global Biobank Meta-analysis Initiative (GBMI), yielding an estimated h2SNP of 0.056 (SE = 0.0038) and 275 independently associated loci. To enhance mechanistic inference beyond variant-level associations, we developed a multimodal framework to predict asthma risk integrating GWAS summary statistics, bulk tissue expression quantitative trait loci (eQTL) data from the Genotype-Tissue Expression (GTEx) project, and single-cell gene eQTL data from the OneK1K Project. We performed transcriptome-wide association studies (TWAS) and subsequently applied probabilistic fine-mapping with FOCUS to prioritize putative causal genes expressed in bulk tissues and higher resolution immune cell populations. Fine-mapping asthma-associated genes implicated barrier-immune and metabolic-endocrine tissues alongside adaptive T-cell subsets as the primary mediators of asthma genetic risk, resolving canonical CD4+ Th2 effector genes including IL1RL1, TSLP, STAT6, and GATA3. Using these prioritized genes, we constructed a polygenic transcriptome risk score (PTRS) using random forest to integrate gene-level effects across critical tissues and cell types. Evaluated in two ancestrally distinct pediatric asthma cohorts, the Childhood Asthma Management Program (CAMP) and the Genetics of Asthma in Costa Rica Study (GACRS), our PTRS demonstrated improved transferability over the standard variant-level and gene-level baseline models. While modest common variant heritability limits the discriminative power of our models, we estimated a theoretical maximum achievable area under the receiver operating characteristic (AUROC) curve of 0.64. Our integrative nonlinear model of PRS-CSx and cross-modal (bulk tissue and single cell) FOCUS PTRS resulted in the best cross-cohort performance (CAMP AUC = 0.632, sd = 0.04, 3.55 case/control odds ratio in top vs. bottom quartiles), representing an increase of +0.118 AUC over PRS-CSx, +0.067 AUC over tissue-specific TWAS pruning and thresholding, and +0.041 AUC over cell-type-specific FOCUS PTRS. Our results demonstrate that modeling nonlinear interactions between variant- and gene-level effects across both bulk tissue and single cell eQTL data improves our ability to determine high-risk individuals and to explain the likely mechanisms driving genetic susceptibility of childhood-onset asthma.

10

LocusBlend: Flexible multi-index regional visualization of genomic association signals

yang, c.; Cook, N.; Zeng, Y.; Fu, T.; budde, J.; Cruchaga, C.; Belloy, M. E.

2026-07-21 genetic and genomic medicine 10.64898/2026.07.15.26358129 medRxiv

Top 8%

0.4%

Show abstract

Summary It has become standard practice to visualize regional signals from genomewide association studies GWAS using LocusZoom plots Similarly GWAS signals are compared to regionally matched quantitative trait loci QTLs ie varianttogene regulation data using LocusCompare plots to aid assessment of candidate traitrelated genes Despite broad usage these tools annotate variants by linkage disequilibrium LD to a single lead or index variant This singleindex representation has limitations for visualizing complex loci that contain multiple independent signals We present LocusBlend an interactive web application for multiindex LDblended visualization of genomic loci LocusBlend supports one or two genomic association summarystatistic datasets and one to three index variants multiindex LocusZoom colorblended plots and matching LocusCompare visualizations Applications to Alzheimers disease GWAS and QTL signals illustrate LocusBlend enables visualization and separation of independent signals despite shared LD and high genomic complexity Overall LocusBlend is aimed at supporting researchers handle the continuously expanding complexity of human genomics findings Availability and Implementation LocusBlend is freely available at httpslocusblendwustledu Publication ready plots are generated in 1min Source code documentation example datasets input templates and reproducibility instructions are available at httpsgithubcomBelloyLabLocusBlend LocusBlend is implemented in Python using Streamlit Plotly and PLINK Supplementary Information Supplementary data are available online

11

Efficient stochastic epidemic simulation via the Sellke construction

van Boven, M.; Bootsma, M. C.

2026-07-17 epidemiology 10.64898/2026.07.16.26358219 medRxiv

Top 8%

0.3%

Show abstract

Stochastic epidemic models are a cornerstone of infectious disease epidemiology and are often used to study intervention scenarios. However, large run-to-run variability can make intervention effects difficult to estimate precisely. We revisit the epidemic Sellke construction, which assigns each individual an infection threshold for the cumulative infection hazard such that, conditional on the thresholds, the epidemic trajectory becomes deterministic. This enables coupling of simulations with and without an intervention, yielding low-variance effect estimates even when outcomes such as final size or peak incidence vary widely between runs. We develop an exact, event-driven implementation that maintains infection and recovery events in priority queues. Cumulative infection-hazard updates require O(log N) time per event, yielding overall complexity O(Elog N) for E events in a population of size N. The implementation achieves computational performance comparable to the classical Gillespie algorithm while naturally accommodating non-Markovian infectious periods and complex infectiousness profiles. We illustrate the approach using distance-dependent spread of avian influenza between poultry farms in the Netherlands and a multilayer population with households, schools, and workplaces. In both examples, coupling enables efficient within-run comparisons of intervention scenarios across stochastic realisations.

12

Evaluation of four large language models on complex, infectious disease case scenarios

Pradhan, A.; Waxse, B.; Matias, W. R.; Mercaldo, S.; Bowman, K.; Nutt, C.; Kanjilal, S.; Hillis, J. M.

2026-07-15 infectious diseases 10.64898/2026.07.14.26358021 medRxiv

Top 8%

0.3%

Show abstract

Objectives: Large language models (LLMs) are increasingly used in medicine, but evaluation is often on multiple choice questions and management of common conditions. Infectious diseases (ID) can present complex scenarios that require considerations beyond guideline-based responses. We assessed LLM performance in these situations including with ID-specific criteria to consider infection control or antimicrobial stewardship (AMS). Methods: We evaluated four LLMs (Claude 3.5 Sonnet, GPT-4o, GPT-o1, and a local instance of Llama 3.1 8B) in October 2024, on five complex ID vignettes. The LLM responses were each evaluated for 18 items by two board-certified ID clinicians and pairwise comparisons were performed between LLMs. Results: There was no significant difference between performance of GPT-o1, GPT-4o and Claude Sonnet on general medical criteria, and were comparable with respect to how often they provided an unsafe response (GPT-o1 30%, GPT-4o 40%, Claude 37%) and contained a critical omission (GPT-o1 27%, GPT-4o 43%, Claude 47%). Llama 3.1 8B had significantly decreased performance for most criteria. On ID-specific criteria, GPT-o1 outperformed other models and all models significantly outperformed Llama for interpreting microbiology results, AMS principles, appropriate antimicrobial spectrum and infection control considerations. Performance was poor in secondary prevention and management of risk factors. Conclusions: On complex ID scenarios, LLM responses were variable. The open-source, smaller Llama 3.1 8B model performed poorly and large, non-reasoning models varied, but more than 30% of responses containing a risk of harm or critical omission. These findings suggest caution is required when deploying these models in ID domains without specialist oversight.

13

Reliability-weighted target prioritization in CD4+ T-cell Perturb-seq: a generalizability-theory decomposition

Cheng, C.

2026-07-15 bioinformatics 10.64898/2026.07.13.738312 medRxiv

Top 8%

0.3%

Show abstract

Genome-scale Perturb-seq screens prioritize candidate targets by the strength of a perturbations transcriptional effect. Effect strength does not answer a prior measurement question: is the readout dependable? A large effect estimated from a single guide, a single donor, or a pseudobulk of few cells need not survive replication, and for target prioritization each false lead costs a validation experiment. We treat each perturbation effect as a measurement in a crossed Target x Guide x Donor x Condition design and apply generalizability theory (Brennan, 2001; Cronbach et al., 1972) to separate the dependable part of an effect from facet-specific idiosyncrasy. Guides and donors enter as random facets; condition enters as a fixed facet and is analyzed within its levels. For each target we report a dependability profile over the facets and a joint generalizability coefficient over the two random facets, and we re-rank targets by effect magnitude weighted by that coefficient. On the released screen (Zhu et al., 2025), removing the measurement-error floor estimated from the non-targeting controls raises the number of genes with a dependable target-signal share above .10 from 40 to 7,674. Analyzed within activation states, dependability recovers the T-cell-receptor signaling module as reliably measurable only in activated cells, without recourse to gene annotation. A design study indicates that reliability is limited by the number of guides rather than the number of donors, so a future screen should add guides. Every methodological decision was recorded and adversarially reviewed, and all results regenerate from the released summary statistics.

14

A ReAct Agentic AI System for Natural Language Querying and Statistical Analysis of The Cancer Genome Atlas Clinical Data

Korutla, R.; Amal, S.

2026-07-17 health informatics 10.64898/2026.07.15.26358188 medRxiv

Top 9%

0.3%

Show abstract

The Cancer Genome Atlas (TCGA) holds clinical data for over 11,000 patients across 33 cancer types, but access is hard because of complex file structures, heterogeneous formats, and the need for programming. We present an agentic system for natural language querying and statistical analysis of TCGA clinical data. The system uses a large language model as an autonomous ReAct agent that selects from eight computational tools, including data extraction, descriptive statistics, Kaplan-Meier survival analysis with log-rank tests, hypothesis testing, and verification against the curated TCGA Pan-Cancer Clinical Data Resource (CDR). The agent reasons about intermediate results, adapts its approach, and returns clinically contextualized responses with source attribution and auditable traces. We introduce TCGA-Agent-Bench, 440 queries across five difficulty tiers with ground truth from the independently curated TCGA-CDR, evaluated with dual metrics of numerical accuracy and clinical completeness. The system achieves 93.4% overall accuracy (100% single-patient lookups, 99.1% cohort statistics, 92.8% comparative analyses), outperforming a fixed rule-based pipeline (87.1%), a single-pass LLM (81.8%), and retrieval-augmented generation (66.9% on a subset). Most of the benchmark is answerable from the CDR alone, so we locate the extraction layer's value in fields the CDR lacks (drug treatments, TNM components, biomarkers, biospecimen metadata): on 26 queries targeting these, the full system answers 100% versus 3.8% for CDR-only. Ablations show the reasoning loop is most impactful (+9.1% accuracy, +22.0 completeness points). A tool-based agentic architecture enables accurate, auditable analysis of clinical repositories, with value driven by tool design and recovered fields rather than model scale.

15

The Variance-Stabilizing Transformation for the Poisson Rate Ratio: Closed-Form Confidence Intervals

Ng, S.-P.

2026-07-18 epidemiology 10.64898/2026.07.16.26358255 medRxiv

Top 9%

0.3%

Show abstract

The incidence rate ratio R is the standard measure for comparing event rates in clinical trials and epidemiology. In vaccine trials, the vaccine efficacy is VE = 1 - R. When events are rare, the two arm counts are Poisson. The estimator of R is heteroskedastic: its sampling variance changes with the data. So no fixed-width interval covers correctly everywhere. The usual log-Wald interval is undefined at zero events and covers poorly at small counts. Early vaccine and drug-safety readouts fall in exactly this regime. We show that a single reparameterization collapses this bivariate problem to an effective one-parameter family with a quadratic variance function, whose variance-stabilizing transformation is 2 arcsinh(sqrt(R)). The reduction yields a closed-form confidence interval for R. Its two leading errors, a curvature bias and the variability of the estimated scale, each admit a closed-form correction with no tuning constants. In a Monte Carlo study of our seven arcsinh variants and five competitors, the +Curve+Stu variant covers within 0.002 of the nominal 0.95 for about 50 control and 5 treatment events. Its width is on par with the best competitor. It avoids the conservatism and zero-count breakdown of log-Wald and MOVER. For moderate counts, we recommend this interval; for sparser data, our Bar-Lev and Enis count-shift variant is more robust. The result is a ready-to-use, closed-form interval for the low-count regime. We illustrate it on early Covid-19 vaccine-efficacy readouts and provide reference implementations in R and Python.

16

ReCo: a self-configuring and self-extending agentic framework for biomedical research

Tzanis, E.; Klontzas, M. E.

2026-07-16 health informatics 10.64898/2026.07.14.26358025 medRxiv

Top 9%

0.3%

Show abstract

This study presents ReCo (Research Cosmos), a self-configuring and self-extending agentic research framework for the biomedical domain. ReCo is orchestrated by a large language model that interacts with native computing tools, bundled Model Context Protocol (MCP) servers, structured skills, persistent project memory, and a desktop interface. Its bundled MCP servers provide biomedical analysis capabilities while serving as implementation paradigms for integrating new computational and AI frameworks. Structured skills encode procedures for environment configuration and framework ingestion, enabling ReCo to inspect repositories, manuscripts, or local codebases; identify dependencies and execution patterns; create isolated runtime environments; design and implement MCP interfaces. Self-extension was evaluated using five heterogeneous systems: the Merlin computed tomography foundation model, MAISI-v2 medical image synthesis framework, asari liquid chromatography-mass spectrometry workflow, DosimeTron agentic radiation-dosimetry platform, and Orthanc DICOM server. ReCo successfully operationalized all five systems and completed predefined functional evaluations. Re-hosted DosimeTron outputs demonstrated near-perfect agreement with the reference pipeline across 651 organ observations (Pearson correlation and Lin concordance correlation coefficient, 0.99999; mean absolute percentage difference, 0.37%). Notably, ReCo configured Orthanc as a PACS-like coordination layer, integrated it with DosimeTron, Merlin, and TotalSegmentator, and orchestrated data retrieval, analysis, and return of valid DICOM RTSTRUCT, RTDOSE, and Structured Report. ReCo provides a unified environment for configuring, documenting, and operationalizing heterogeneous biomedical frameworks, reducing technical barriers to the adoption and integration of emerging computational and AI methods. The official open-source ReCo GitHub repository is available at: https://github.com/eltzanis/ReCo

17

A New Method to Predict the Effect of an Intervention in the Host Population to Reduce the Magnitude of an Outbreak of a Vector-Borne Infection

Coutinho, F. A. B.; Amaku, M.; Kallas, E. G.; Massad, E.

2026-07-19 epidemiology 10.64898/2026.07.16.26358272 medRxiv

Top 9%

0.2%

Show abstract

In this paper, we propose a new model to estimate the impact of an intervention on human hosts of a vector-borne infection, such as dengue, which occurs in yearly outbreaks of different magnitudes. The model applies to these outbreaks and, in fact, is independent of their intensity, that is, it does not require the steady-state assumption. The model takes as input the officially reported age-dependent number of cases of a vector-borne infection. It is deterministic and does not account for stochasticity. Our objective is to estimate the impact of the intervention (the efficacy), and we rely on the observed fact that the age distribution of the proportion of cases of the infections transmitted by the same vector is independent of both the intensity of transmission and the geographic area studied, at least for Brazilian regions. This finding is highlighted in the main text and forms the basis of our calculations. A hypothetical intervention is simulated using a dengue vaccine, which allows the determination of the optimal strategy for a vaccination campaign.

18

Storing >1 byte of information in 16S ribosomal RNA using orthogonal trans-splicing ribozymes

Dysart, M. J.; Fang, L.; Karinje, L. K.; Chappell, J.; Stadler, L. B.; Silberg, J. J.

2026-07-15 synthetic biology 10.64898/2026.07.14.738544 medRxiv

Top 9%

0.2%

Show abstract

TEXT ABSTRACTCatalytic-RNA (cat-RNA) expressed from mobile DNA can record cellular events, such as the uptake of plasmids via horizontal gene transfer, by splicing a barcode onto 16S ribosomal RNA (rRNA) - a system termed RNA addressable modification (RAM). However, scaling RAM to record multiple simultaneous biological events requires large numbers of orthogonal cat-RNA whose signals reflect the biological features under investigation rather than variability arising from the barcode sequence. Here, we explore how to design orthogonal cat-RNA to record information about multiple plasmid-encoded traits in parallel. We show that cat-RNA having tRNA-derived barcodes with sequence variation in the anticodon stem-loop present greater signal consistency within Escherichia coli than mRNA-derived barcodes. When orthogonal cat-RNA designs harboring tRNA-derived barcodes were evaluated in Vibrio natriegens and Pseudomonas putida, increased variance was observed compared with Escherichia coli. Nevertheless, the signal consistency was sufficient to use these orthogonal cat-RNAs to report on the relative activities of four promoters and two origins of replication by sequencing barcoded-rRNA derived from the three organisms. These results show how RAM can be multiplexed to report on mobile DNA features in microbial communities and illustrate the importance of accounting for variability in RNA outputs when designing and interpreting multiplexed RNA barcoding data. GRAPHICAL ABSTRACT O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=88 SRC="FIGDIR/small/738544v1_ufig1.gif" ALT="Figure 1"> View larger version (29K): org.highwire.dtl.DTLVardef@406ebaorg.highwire.dtl.DTLVardef@259751org.highwire.dtl.DTLVardef@1f1512corg.highwire.dtl.DTLVardef@8384b_HPS_FORMAT_FIGEXP M_FIG C_FIG

19

Malaria Pre-screening Technology Using Artificial Intelligence (AI)

Ibeto, O. O.; Nwoye, E. O.

2026-07-17 infectious diseases 10.64898/2026.07.15.26357432 medRxiv

Top 9%

0.2%

Show abstract

Malaria remains a severe health problem in endemic regions because people lack adequate diagnostic tools, leading to delayed medical care and elevated death rates. This research introduces a dual-mode artificial intelligence system that uses two complementary models to enhance malaria pre-screening and diagnosis. The patient-centered model uses multivariate logistic regression to analyze biosignals, including heart rate, body temperature, and oxygen saturation, collected through a wearable sensor prototype and a mobile interface for symptom analysis. The system enables patients to begin self-assessment to determine their level of need before scheduling a doctor's appointment. The clinician-centered model represents a customized convolutional neural network that uses annotated microscopy images of red blood cells to achieve 94.84% accuracy, 95.71% precision, 93.87% recall, 94.78% F1 score, and 0.84 Area Under Curve (AUC). The patient model achieved 94.6% accuracy and an AUC of 0.985 using a 70/30 train-test split. These systems work together to create a layered diagnostic system that can operate independently or together to detect malaria at an early stage, especially in areas with limited resources. The findings demonstrate that wearable biosignal data integration with image-based deep learning can produce dependable, scalable, and user-friendly systems for malaria pre-screening. Keywords - malaria diagnosis, artificial intelligence (AI), convolutional neural networks (CNN), wearable biosensors, multivariate logistic regression

20

Rationale and guidance for implementing the continual reassessment method for dose-finding in controlled human infection model studies

Weerasinghe, C.; Osowicki, J.; Simpson, J. A.; Crocker-Buque, T.; McCarthy, J.; Williams, E.; Price, D. J.

2026-07-17 infectious diseases 10.64898/2026.07.16.26358128 medRxiv

Top 10%

0.2%

Show abstract

Controlled human infection models (CHIMs) are increasingly used in infectious disease research to study pathogen dynamics and evaluate interventions under controlled conditions. However, these studies are resource-intensive and involve ethical and safety constraints, making efficient study design critical. Dose-finding is a key early component in CHIMs, where the aim is to identify a challenge dose that achieves a target infection probability. Traditional rule-based designs are commonly used but can be inefficient, motivating the use of model-based adaptive approaches such as the Bayesian Continual Reassessment Method (CRM). Although CRM has been extensively studied and widely adopted in Phase I oncology trials for identifying the maximum tolerated dose of therapeutics, its application in CHIM settings remains limited, particularly when the endpoint of interest is infection. This tutorial provides step-by-step guidance for implementing a Bayesian CRM in dose-finding CHIMs, using an oropharyngeal Neisseria gonorrhoeae challenge as a motivating case study. The framework outlines key design components, including dose-grid specification, dose-response model, prior elicitation, Bayesian updating, decision rules, and stopping criteria, with particular emphasis on a clinically interpretable parameterisation. Trial operating characteristics are evaluated through simulation studies under multiple dose-response scenarios and prior-predictive analyses, and compared with a commonly used '3+3' type rule-based design. This work highlights the advantages of Bayesian model-based designs for dose-finding in CHIMs over classic rule-based designs and provides a structured, reproducible framework for implementing CRM, supporting their application in future CHIM studies.